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Abstract 



We describe a variational approximation method for efficient inference in large-scale 
probabilistic models. Variational methods are deterministic procedures that provide ap- 
proximations to marginal and conditional probabilities of interest. They provide alterna- 
tives to approximate inference methods based on stochastic sampling or search. We describe 
a variational approach to the problem of diagnostic inference in the "Quick Medical Ref- 
erence" (QMR) network. The QMR network is a large-scale probabilistic graphical model 
built on statistical and expert knowledge. Exact probabilistic inference is infeasible in this 
model for all but a small set of cases. We evaluate our variational inference algorithm on a 
large set of diagnostic test cases, comparing the algorithm to a state-of-the-art stochastic 
sampling method. 

1. Introduction 

Probabilistic models have become increasingly prevalent in AI in recent years. Beyond 
the significant representational advantages of probability theory, including guarantees of 
consistency and a naturalness at combining diverse sources of knowledge (Pearl, 1988), 
the discovery of general exact inference algorithms has been principally responsible for the 
rapid growth in probabilistic AI (see, e.g., Lauritzen & Spiegelhalter, 1988; Pearl, 1988; 
Shenoy, 1992). These exact inference methods greatly expand the range of models that can 
be treated within the probabilistic framework and provide a unifying perspective on the 
general problem of probabilistic computation in graphical models. 

Probability theory can be viewed as a combinatorial calculus that instructs us in how 
to merge the probabilities of sets of events into probabilities of composites. The key oper- 
ation is that of marginalization, which involves summing (or integrating) over the values 
of variables. Exact inference algorithms essentially find ways to perform as few sums as 
possible during marginalization operations. In terms of the graphical representation of 
probability distributions — in which random variables correspond to nodes and conditional 
independencies are expressed as missing edges between nodes — exact inference algorithms 
define a notion of "locality" (for example as cliques in an appropriately defined graph), and 
attempt to restrict summation operators to locally defined sets of nodes. 
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While this approach manages to stave off the exponential explosion of exact probabilistic 
computation, such an exponential explosion is inevitable for any calculus that explicitly 
performs summations over sets of nodes. That is, there are models of interest in which 
"local" is overly large (see Jordan, et al., in press). From this point of view, it is perhaps 
not surprising that exact inference is NP-hard (Cooper, 1990). 

In this paper we discuss the inference problem for a particular large-scale graphical 
model, the Quick Medical Reference (QMR) model. ^ The QMR model consists of a com- 
bination of statistical and expert knowledge for approximately 600 significant diseases and 
approximately 4000 findings. In the probabilistic formulation of the model (the QMR-DT), 
the diseases and the findings are arranged in a bi-partite graph, and the diagnosis problem 
is to infer a probability distribution for the diseases given a subset of findings. Given that 
each finding is generally relevant to a wide variety of diseases, the graph underlying the 
QMR-DT is dense, refiecting high-order stochastic dependencies. The computational com- 
plexity of treating these dependencies exactly can be characterized in terms of the size of 
the maximal clique of the "moralized" graph (see, e.g., Dechter, 1998; Lauritzen & Spiegel- 
halter, 1988). In particular, the running time is exponential in this measure of size. For the 
QMR-DT, considering the standardized "clinocopathologic conference" (CPC) cases that 
we discuss below, we find that the median size of the maximal clique of the moralized graph 
is 151.5 nodes. This rules out the use of general exact algorithms for the QMR-DT. 

The general algorithms do not take advantage of the particular parametric form of the 
probability distributions at the nodes of the graph, and it is conceivable that additional 
factorizations might be found that take advantage of the particular choice made by the 
QMR-DT. Such a factorization was in fact found by Heckerman (1989); his "Quickscore 
algorithm" provides an exact inference algorithm that is tailored to the QMR-DT. Unfortu- 
nately, however, the run time of the algorithm is still exponential in the number of positive 
findings. For the CPC cases, we estimate that the algorithm would require an average of 
50 years to solve the inference problem on current computers. 

Faced with the apparent infeasibility of exact inference for large-scale models such as 
the QMR-DT, many researchers have investigated approximation methods. One general 
approach to developing approximate algorithms is to perform exact inference, but to do so 
partially. One can consider partial sets of node instantiations, partial sets of hypotheses, 
and partial sets of nodes. This point of view has led to the development of algorithms for 
approximate inference based on heuristic search. Another approach to developing approx- 
imation algorithms is to exploit averaging phenomena in dense graphs. In particular, laws 
of large numbers tell us that sums of random variables can behave simply, converging to 
predictable numerical results. Thus, there may be no need to perform sums explicitly, either 
exactly or partially. This point of view leads to the variational approach to approximate 
inference. Finally, yet another approach to approximate inference is based on stochastic 
sampling. One can sample from simplified distributions and in so doing obtain information 
about a more complex distribution of interest. We discuss each of these methods in turn. 

Horvitz, Suermondt and Cooper (1991) have developed a partial evaluation algorithm 
known as "bounded conditioning" that works by considering partial sets of node instan- 

1. The acronym "QMR-DT" that we use in this paper refers to the "decision-theoretic" reformulation of 
the QMR by Shwe, et al. (1991). Shwe, et al. replaced the heuristic representation employed in the 
original QMR model (Miller, Fasarie, & Myers, 1986) by a probabilistic representation. 
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tiations. The algorithm is based on the notion of a "cutset"; a subset of nodes whose 
removal renders the remaining graph singly-connected. Efficient exact algorithms exist for 
singly-connected graphs (Pearl, f988). Summing over all instantiations of the cutset, one 
can calculate posterior probabilities for general graphs using the efficient algorithm as a 
subroutine. Unfortunately, however, there are exponentially many such cutset instantia- 
tions. The bounded conditioning algorithm aims at forestalling this exponential growth by 
considering partial sets of instantiations. Although this algorithm has promise for graphs 
that are "nearly singly-connected," it seems unlikely to provide a solution for dense graphs 
such as the QMR-DT. In particular, the median cutset size for the QMR-DT across the 
CPC cases is f06.5, yielding an unmanageably large number of 2^°^-^ cutset instantiations. 

Another approach to approximate inference is provided by "search-based" methods, 
which consider node instantiations across the entire graph (Cooper, f985; Henrion, f99f; 
Peng & Reggia, f987). The general hope in these methods is that a relatively small fraction 
of the (exponentially many) node instantiations contains a majority of the probability mass, 
and that by exploring the high probability instantiations (and bounding the unexplored 
probability mass) one can obtain reasonable bounds on posterior probabilities. The QMR- 
DT search space is huge, containing approximately 2^°° disease hypotheses. If, however, 
one only considers cases with a small number of diseases, and if the hypotheses involving 
a small number of diseases contain most of the high probability posteriors, then it may 
be possible to search a significant fraction of the relevant portions of the hypothesis space. 
Henrion (f99f) was in fact able to run a search-based algorithm on the QMR-DT inference 
problem, for a set of cases characterized by a small number of diseases. These were cases, 
however, for which the exact Quickscore algorithm is efficient. The more general corpus of 
CPC cases that we discuss in the current paper is not characterized by a small number of 
diseases per case. In general, even if we impose the assumption that patients have a limited 
number N of diseases, we cannot assume a priori that the model will show a sharp cutoff 
in posterior probability after disease N . Finally, in high-dimensional search problems it is 
often necessary to allow paths that are not limited to the target hypothesis subspace; in 
particular, one would like to be able to arrive at a hypothesis containing few diseases by 
pruning hypotheses containing additional diseases (Peng & Reggia, f987). Imposing such a 
limitation can lead to failure of the search. 

More recent partial evaluation methods include the "localized partial evaluation" method 
of Draper and Hanks (f994), the "incremental SPI" algorithm of D'Ambrosio (f993), the 
"probabilistic partial evaluation" method of Poole (f997), and the "mini-buckets" algorithm 
of Dechter (f997). The former algorithm considers partial sets of nodes, and the latter three 
consider partial evaluations of the sums that emerge during an exact inference run. These 
are all promising methods, but like the other partial evaluation methods it is yet not clear if 
they restrict the exponential growth in complexity in ways that yield realistic accuracy/time 
tradeoffs in large-scale models such as the QMR-DT.^ 

Variational methods provide an alternative approach to approximate inference. They 
are similar in spirit to partial evaluation methods (in particular the incremental SPI and 
mini-buckets algorithms), in that they aim to avoid performing sums over exponentially 

2. D'Ambrosio (1994) reports "mixed" results using incremental SPI on the QMR-DT, for a somewhat 
more difficult set of cases than Heckerman (1989) and Henrion (1991), but still with a restricted number 
of positive findings. 
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many summands, but they come at the problem from a different point of view. From the 
variational point of view, a sum can be avoided if it contains a sufficient number of terms 
such that a law of large numbers can be invoked. A variational approach to inference 
replaces quantities that can be expected to be the beneficiary of such an averaging process 
with surrogates known as "variational parameters." The inference algorithm manipulates 
these parameters directly in order to find a good approximation to a marginal probability of 
interest. The QMR-DT model turns out to be a particularly appealing architecture for the 
development of variational methods. As we will show, variational methods have a simple 
graphical interpretation in the case of the QMR-DT. 

A final class of methods for performing approximate inference are the stochastic sam- 
pling methods. Stochastic sampling is a large family, including techniques such as rejection 
sampling, importance sampling, and Markov chain Monte Carlo methods (MacKay, 1998). 
Many of these methods have been applied to the problem of approximate probabilistic in- 
ference for graphical models and analytic results are available (Dagum & Horvitz, 1993). 
In particular, Shwe and Cooper (1991) proposed a stochastic sampling method known as 
"likelihood-weighted sampling" for the QMR-DT model. Their results are the most promis- 
ing results to date for inference for the QMR-DT — they were able to produce reasonably 
accurate approximations in reasonable time for two of the difficult CPC cases. We consider 
the Shwe and Cooper algorithm later in this paper; in particular we compare the algorithm 
empirically to our variational algorithm across the entire corpus of CPC cases. 

Although it is important to compare approximation methods, it should be emphasized 
at the outset that we do not think that the goal should be to identify a single champion 
approximate inference technique. Rather, different methods exploit different structural 
features of large-scale probability models, and we expect that optimal solutions will involve 
a combination of methods. We return to this point in the discussion section, where we 
consider various promising hybrids of approximate and exact inference algorithms. 

The general problem of approximate inference is NP-hard (Dagum & Luby, 1993) and 
this provides additional reason to doubt the existence of a single champion approximate 
inference technique. We think it important to stress, however, that this hardness result, 
together with Cooper's (1990) hardness result for exact inference cited above, should not 
be taken to suggest that exact inference and approximate inference are "equally hard." To 
take an example from a related field, there exist large domains of solid and fiuid mechanics 
in which exact solutions are infeasible but in which approximate techniques (finite element 
methods) work well. Similarly, in statistical physics, very few models are exactly solvable, 
but there exist approximate methods (mean field methods, renormalization group methods) 
that work well in many cases. We feel that the goal of research in probabilistic inference 
should similarly be that of identifying effective approximate techniques that work well in 
large classes of problems. 

2. The QMR-DT Network 

The QMR-DT network (Shwe et al., 1991) is a two-level or bi-partite graphical model (see 
Figure 1). The top level of the graph contains nodes for the diseases, and the bottom level 
contains nodes for the findings. 
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There are a number of conditional independence assumptions reflected in the bi-partite 
graphical structure. In particular, the diseases are assumed to be marginally independent. 
(I.e., they are independent in the absence of flndings. Note that diseases are not assumed 
to be mutually exclusive; a patient can have multiple diseases). Also, given the states 
of the disease nodes, the flndings are assumed to be conditionally independent. (For a 
discussion regarding the medical validity and the diagnostic consequences of these and 
other assumptions embedded into the QMR-DT belief network, see Shwe et al., 1991). 

diseases 



dj d 




findings 

Figure 1: The QMR belief network is a two-level graph where the dependencies between 
the diseases and their associated flndings have been modeled via noisy-OR gates. 



To state more precisely the probability model implied by the QMR-DT model, we write 
the joint probability of diseases and flndings as: 



p(/,rf) = PU\d)P{d) 



n 



1 



where d and / are binary (1/0) vectors referring to presence/absence states of the diseases 
and the positive/negative states or outcomes of the flndings, respectively. The conditional 
probabilities P{fi\d) are represented by the "noisy-OR model" (Pearl, 1988): 

p{h = m n ^u- = ^\dj) 
(1 - fto) n (1 - 'i^jY' 



(2) 
(3) 
(4) 



where tTj- is the set of diseases that are parents of the flnding fi in the QMR graph, g^j = 
P{fi = l\dj = 1) is the probability that the disease j, if present, could alone cause the 
flnding to have a positive outcome, and qiQ = P{fi = l|iy) is the "leak" probability, i.e., 
the probability that the flnding is caused by means other than the diseases included in 
the QMR model. In the flnal line, we reparameterize the noisy-OR probability model 
using an exponentiated notation. In this notation, the model parameters are given by 
= -log(l - q,j). 
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3. Inference 

Carrying out diagnostic inference in the QMR model involves computing the posterior 
marginal probabilities of the diseases given a set of observed positive { fi = 1) and negative 
(/j-; = 0) findings. Note that the set of observed findings is considerably smaller than the set 
of possible findings; note moreover (from the bi-partite structure of the QMR-DT graph) 
that unobserved findings have no effect on the posterior probabilities for the diseases. For 
brevity we adopt a notation in which /^^ corresponds to the event fi = 1, and /~ refers 
to fi = (positive and negative findings respectively). Thus the posterior probabilities of 
interest are P{dj\f'^ , f~), where /+ and /~ are the vectors of positive and negative findings. 

The negative findings /~ are benign with respect to the inference problem — they can be 
incorporated into the posterior probability in linear time in the number of associated diseases 
and in the number of negative findings. As we discuss below, this can be seen from the 
fact that the probability of a negative finding in Eq. (4) is the exponential of an expression 
that is linear in the dj. The positive findings, on the other hand, are more problematic. In 
the worst case the exact calculation of posterior probabilities is exponentially costly in the 
number of positive findings (Heckerman, 1989; D'Ambrosio, 1994). Moreover, in practical 
diagnostic situations the number of positive findings often exceeds the feasible limit for 
exact calculations. 

Let us consider the inference calculations in more detail. To find the posterior probability 
P{d\f~^ , f~), we first absorb the evidence from negative findings, i.e., we compute P{d\f~). 
This is just P{f~\d)P{d) with normalization. Since both P{f~\d) and P{d) factorize over 
the diseases (see Eq. (1) and Eq. (2) above), the posterior P{d\f~) must factorize as well. 
The normalization of P{f~\d)P{d) therefore reduces to independent normalizations over 
each disease and can be carried out in time linear in the number of diseases (or negative 
findings). In the remainder of the paper, we concentrate solely on the positive findings as 
they pose the real computational challenge. Unless otherwise stated, we assume that the 
prior distribution over the diseases already contains the evidence from the negative findings. 
In other words, we presume that the updates P{dj) <— P{dj\f~) have already been made. 

We now turn to the question of computing P{dj\f'^), the posterior marginal probability 
based on the positive findings. Formally, obtaining such a posterior involves marginalizing 
P{f'^\d)P{d) across the remaining diseases: 



where the summation is over all the possible configurations of the disease variables other 
than dj (we use the shorthand summation index d \ dj for this). In the QMR model 



P(rf,|/+)^^P(/+|rf)P(rf) 



(5) 



d\dj 



P{f+\d)P{d) has the form: 



P{f+\d)P{d) 



I[pift\d) nm-) 



(6) 



- z 




(7) 
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which follows from Eq. (4) and the fact that P(fl^\d) = 1 — P(f~\d). To perform the 
summation in Eq. (5) over the diseases, we would have to multiply out the terms 1 — e^'^ 
corresponding to the conditional probabilities for each positive finding. The number of 
such terms is exponential in the number of positive findings. While algorithms exist that 
attempt to find and exploit factorizations in this expression, based on the particular pattern 
of observed evidence (cf. Heckerman, 1989; D'Ambrosio, 1994), these algorithms are limited 
to roughly 20 positive findings on current computers. It seems unlikely that there is sufficient 
latent factorization in the QMR-DT model to be able to handle the full CPC corpus, which 
has a median number of 36 positive findings per case and a maximum number of 61 positive 
findings. 

4. Variational Methods 

Exact inference algorithms perform many millions of arithmetic operations when applied to 
complex graphical models such as the QMR-DT. While this proliferation of terms expresses 
the symbolic structure of the model, it does not necessarily express the numeric structure 
of the model. In particular, many of the sums in the QMR-DT inference problem are sums 
over large numbers of random variables. Laws of large numbers suggest that these sums 
may yield predictable numerical results over the ensemble of their summands, and this fact 
might enable us to avoid performing the sums explicitly. 

To exploit the possibility of numerical regularity in dense graphical models we develop 
a variational approach to approximate probabilistic inference. Variational methods are a 
general class of approximation techniques with wide application throughout applied math- 
ematics. Variational methods are particularly useful when applied to highly-coupled sys- 
tems. By introducing additional parameters, known as "variational parameters" — which 
essentially serve as low-dimensional surrogates for the high-dimensional couplings of the 
system — these methods achieve a decoupling of the system. The mathematical machinery 
of the variational approach provides algorithms for finding values of the variational pa- 
rameters such that the decoupled system is a good approximation to the original coupled 
system. 

In the case of probabilistic graphical models variational methods allow us to simplify a 
complicated joint distribution such as the one in Eq. (7). This is achieved via parameter- 
ized transformations of the individual node probabilities. As we will see later, these node 
transformations can be interpreted graphically as delinking the nodes from the graph. 

How do we find appropriate transformations? The variational methods that we consider 
here come from convex analysis (see Appendix 6). Let us begin by considering methods for 
obtaining upper bounds on probabilities. A well-known fact from convex analysis is that 
any concave function can be represented as the solution to a minimization problem: 

f{x) = mm{ex-r{0} (8) 

where /*(^) is the conjugate function of f{x). The function /*(^) is itself obtained as the 
solution to a minimization problem: 

nO = mm{ex-f{x)}. (9) 
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The formal identity of this pair of minimization problems expresses the "duality" of / and 
its conjugate /*. 

The representation of / in Eq. (8) is known as a variational transformation. The pa- 
rameter ^ is known as a variational parameter. If we relax the minimization and fix the the 
variational parameter to an arbitrary value, we obtain an upper bound: 

f{x)<ex-no. (10) 

The bound is better for some values of the variational parameter than for others, and for a 
particular value of ^ the bound is exact. 

We also want to obtain lower bounds on conditional probabilities. A straightforward 
way to obtain lower bounds is to again appeal to conjugate duality and to express func- 
tions in terms of a maximization principle. This representation, however, applies to convex 
functions — in the current paper we require lower bounds for concave functions. Our con- 
cave functions, however, have a special form that allows us to exploit conjugate duality in a 
different way. In particular, we require bounds for functions of the form f{a-\-J2j where 
/ is a concave function, where Zj for i G {1, 2, . . . , ra} are non-negative variables, and where 
a is a constant. The variables Zj in this expression are effectively coupled — the impact of 
changing one variable is contingent on the settings of the remaining variables. We can use 
Jensen's inequality, however, to obtain a lower bound in which the variables are decoupled."^ 
In particular: 

/(«+E^.) = /(« + E^.t) (11) 
> E^./(«+-) (12) 

3 ?J 

where the qj can be viewed as defining a probability distribution over the variables Zj. The 
variational parameter in this case is the probability distribution q. The optimal setting 
of this parameter is given by qj = Zj/^j^Zk. This is easily verified by substitution into 
Eq. (12), and demonstrates that the lower bound is tight. 

4.1 Variational Upper and Lower Bounds for Noisy-OR 

Let us now return to the problem of computing the posterior probabilities in the QMR 
model. Recall that it is the conditional probabilities corresponding to the positive findings 
that need to be simplified. To this end, we write 

P(/+|rf) = i_e-'''0-E,''».'i. =elog(i-e-) (^3^ 

where x = 9iQ + J2j ^ij'^j- Consider the exponent f{x) = log(l — e~^). For noisy-OR, as 
well as for many other conditional models involving compact representations (e.g., logistic 
regression), the exponent f{x) is a concave function of x. Based on the discussion in the 

3. Jensen's inequality, which states that f(a -\- Ij^j) > li fi'^ + ^i)' concave /, where ^ = 1, 
and < gj < 1, is a simple consequence of Eq. (8), where x is taken to be a -|- ^ Qj^j- 
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previous section, we know that there must exist a variational upper bound for this function 
that is linear in x: 

f{x)<^x-r{o (14) 

Using Eq. (9) to evaluate the conjugate function /*(^) for noisy-OR, we obtain: 

r(0 = -eioge+(e+i)iog(e+i) (is) 

The desired bound is obtained by substituting into Eq. (13) (and recalling the definition 

P{f+\d) = e^C^'O+E/'.-i.) (16) 

< ^mo+J2,'^'^d,)-r{i,) ^^^^ 

^ P{ft\d,Q- (18) 

Note that the "variational evidence" P[f^'\d,^i) is the exponential of a term that is linear 
in the disease vector d. Just as with the negative findings, this implies that the variational 
evidence can be incorporated into the posterior in time linear in the number of diseases 
associated with the finding. 

There is also a graphical way to understand the effect of the transformation. We rewrite 
the variational evidence as follows: 

P{f+\d,Q = e«'('''O+E/».'i.)-/*«0 (19) 

= e«»''»o-r(CO]^[eC.''.]''\ (20) 

i 

Note that the first term is a constant, and note moreover that the product is factorized 
across the diseases. Each of the latter factors can be multiplied with the pre-existing 
prior on the corresponding disease (possibly itself modulated by factors from the negative 
evidence). The constant term can be viewed as associated with a delinked finding node fi. 
Indeed, the effect of the variational transformation is to delink the finding node fi from the 
graph, altering the priors of the disease nodes that are connected to that finding node. This 
graphical perspective will be important for the presentation of our variational algorithm — 
we will be able to view variational transformations as simplifying the graph until a point 
at which exact methods can be run. 

We now turn to the lower bounds on the conditional probabilities P[f^'\d). The expo- 
nent f{OiQ + J2j ^ij'^j) the exponential representation is of the form to which we applied 
Jensen's inequality in the previous section. Indeed, since / is concave we need only identify 
the non-negative variables Zj^ which in this case are Oijdj, and the constant a, which is now 
9io. Applying the bound in Eq. (12) we have: 

P{f+\d) = e^C^'O+E/'.-i.) (21) 



> V "lU I (22) 
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once again the bound is linear in the exponent. As in the case of the upper bound, this 
implies that the variational evidence can be incorporated into the posterior distribution in 
time linear in the number of diseases. Moreover, we can once again view the variational 
transformation in terms of delinking the finding node fi from the graph. 

4.2 Approximate Inference for QMR 

In the previous section we described how variational transformations are derived for indi- 
vidual findings in the QMR model; we now discuss how to utilize these transformations in 
the context of an overall inference algorithm. 

Conceptually the overall approach is straightforward. Each transformation involves 
replacing an exact conditional probability of a finding with a lower bound and an upper 
bound: 



Given that such transformations can be viewed as delinking the ith finding node from 
the graph, we see that the transformations not only yield bounds, but also yield a sim- 
plified graphical structure. We can imagine introducing transformations sequentially until 
the graph is sparse enough that exact methods become feasible. At that point we stop 
introducing transformations and run an exact algorithm. 

There is a problem with this approach, however. We need to decide at each step which 
node to transform, and this requires an assessment of the effect on overall accuracy of 
transforming the node. We might imagine calculating the change in a probability of interest 
both before and after a given transformation, and choosing to transform that node that 
yields the least change to our target probability. Unfortunately we are unable to calculate 
probabilities in the original untransformed graph, and thus we are unable to assess the effect 
of transforming any one node. We are unable to get the algorithm started. 

Suppose instead that we work backwards. That is, we introduce transformations for 
all of the findings, reducing the graph to an entirely decoupled set of nodes. We optimize 
the variational parameters for this fully transformed graph (more on optimization of the 
variational parameters below). For this graph inference is trivial. Moreover, it is also easy 
to calculate the effect of reinstating a single exact conditional at one node: we choose to 
reinstate that node which yields the most change. 

Consider in particular the case of the upper bounds (lower bounds are analogous). Each 
transformation introduces an upper bound on a conditional probability P(fl^\d). Thus the 
likelihood of observing the (positive) findings P{ f^) is also upper bounded by its variational 
counterpart -?(/"*" 1^: 




(26) 




(27) 



d 



d 
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We can assess the accuracy of each variational transformation after introducing and op- 
timizing the variational transformations for all the positive findings. Separately for each 
positive finding we replace the variationally transformed conditional probability P(fl^\d, ^i) 
with the corresponding exact conditional P(fl^\d) and compute the difference between the 
resulting bounds on the likelihood of the observations: 

S. = P{f+\0-P{f^\^\^^) (28) 

where P{f~^\^\^i) is computed without transforming the i^^ positive finding. The larger 
the difference 5i is, the worse the i^^ variational transformation is. We should therefore 
introduce the transformations in the ascending order of SiS. Put another way, we should 
treat exactly (not transform) those conditional probabilities whose Si measure is large. 

In practice, an intelligent method for ordering the transformations is critical. Figure 2 
compares the calculation of likelihoods based on the 6i measure as opposed to a method 
that chooses the ordering of transformations at random. The plot corresponds to a repre- 
sentative diagnostic case, and shows the upper bounds on the log-likelihoods of the observed 
findings as a function of the number of conditional probabilities that were left intact (i.e. 
not transformed). Note that the upper bound must improve (decrease) with fewer trans- 
formations. The results are striking — the choice of ordering has a large effect on accuracy 
(note that the plot is on a log-scale). 




2 4 6 8 10 12 
# of exactly treated findings 



Figure 2: The upper bound on the log-likelihood for the delta method of removing trans- 
formations (solid line) and a method that bases the choice on a random ordering 
(dashed line). 

Note also that the curve for the proposed ranking is convex; thus the bound improves 
less the fewer transformations there are left. This is because we first remove the worst 
transformations, replacing them with the exact conditionals. The remaining transforma- 
tions are better as indicated by the delta measure and thus the bound improves less with 
further replacements. 

We make no claims for optimality of the delta method; it is simply a useful heuristic 
that allows us to choose an ordering for variational transformations in a computationally 
efficient way, Note also that our implementation of the method optimizes the variational 
parameters only once at the outset and chooses the ordering of further transformations 
based on these fixed parameters. These parameters are suboptimal for graphs in which 
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substantial numbers of nodes have been reinstated, but we have found in practice that this 
simplified algorithm still produces reasonable orderings. 

Once we have decided which nodes to reinstate, the approximate inference algorithm 
can be run. We introduce transformations at those nodes that were left transformed by the 
ordering algorithm. The product of all of the exact conditional probabilities in the graph 
with the transformed conditional probabilities yields an upper or lower bound on the overall 
joint probability associated with the graph (the product of bounds is a bound). Sums of 
bounds are still bounds, and thus the likelihood (the marginal probability of the findings) 
is bounded by summing across the bounds on the joint probability. In particular, an upper 
bound on the likelihood is obtained via: 

Pin = J2p(f^\d)m < j2Pif^\d,om ^ pu^io m 

d d 

and the corresponding lower bound on the likelihood is obtained similarly: 

Pin = J2P(f^\d)Pid) > J2Pif^\d,9)P{d) ^ P(/+k) (30) 

d d 

In both cases we assume that the graph has been sufficiently simplified by the variational 
transformations so that the sums can be performed efficiently. 

The expressions in Eq. (29) and Eq. (30) yield upper and lower bounds for arbitrary 
values of the variational parameters ^ and q. We wish to obtain the tightest possible bounds, 
thus we optimize these expressions with respect to ^ and q. We minimize with respect to 
^ and maximize with respect to q. Appendix 6 discusses these optimization problems in 
detail. It turns out that the upper bound is convex in the ^ and thus the adjustment of the 
variational parameters for the upper bound reduces to a convex optimization problem that 
can be carried out efficiently and reliably (there are no local minima). For the lower bound 
it turns out that the maximization can be carried out via the EM algorithm. 

Finally, although bounds on the likelihood are useful, our ultimate goal is to approximate 
the marginal posterior probabilities P{dj\f'^). There are two basic approaches to utilizing 
the variational bounds in Eq. (29) and Eq. (30) for this purpose. The first method, which will 
be our emphasis in the current paper, involves using the transformed probability model (the 
model based either on upper or lower bounds) as a computationally efficient surrogate for the 
original probability model. That is, we tune the variational parameters of the transformed 
model by requiring that the model give the tightest possible bound on the likelihood. We 
then use the tuned transformed model as an inference engine to provide approximations to 
other probabilities of interest, in particular the marginal posterior probabilities P{dj\f'^). 
The approximations found in this manner are not bounds, but are computationally efficient 
approximations. We provide empirical data in the following section that show that this 
approach indeed yields good approximations to the marginal posteriors for the QMR-DT 
network. 

A more ambitious goal is to obtain interval bounds for the marginal posterior probabil- 
ities themselves. To this end, let -?(/"*", dj\^) denote the combined event that the QMR-DT 
model generates the observed findings /"*" and that the j^^ disease takes the value dj. These 
bounds follow directly from: 

P{f^,d,) = J2P{f+\d)P{d) < J2P{f+\d,0P{d) = P{f+,d,\0 (31) 

d\dj d\dj 
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where P{f'^\d,^) is a product of upper-bound transformed conditional probabilities and 
exact (untransformed) conditionals. Analogously we can compute a lower hound P{ f~^ , dj\q) 
by applying the lower bound transformations: 

P{f^,d,) = J2P{f+\d)P{d) > J2P{f+\d,q)P{d) = P{f+,d,\q) (32) 

d\dj d\dj 

Combining these bounds we can obtain interval bounds on the posterior marginal proba- 
bilities for the diseases (cf. Draper & Hanks 1994): 

""'^"•"'i'" <pwn< . . y-"'^' ,„ (33) 



P{f*,d,\() + P(f+,d,\q) - ' ■ - P(f+,d,K) + P(f+,d,\q) 

where dj is the binary complement of dj. 



5. Experimental Evaluation 

The diagnostic cases that we used in evaluating the performance of the variational tech- 
niques were cases abstracted from clinocopathologic conference ("CPC") cases. These cases 
generally involve multiple diseases and are considered to be clinically difficult cases. They 
are the cases in which Middleton et al. (1990) did not find their importance sampling method 
to work satisfactorily. 

Our evaluation of the variational methodology consists of three parts. In the first part 
we exploit the fact that for a subset of the CPC cases (4 of the 48 cases) there are a 
sufficiently small number of positive findings that we can calculate exact values of the 
posterior marginals using the Quickscore algorithm. That is, for these four cases we were 
able to obtain a "gold standard" for comparison. We provide an assessment of the accuracy 
and efficiency of variational methods on those four CPC cases. We present variational 
upper and lower bounds on the likelihood as well as scatterplots that compare variational 
approximations of the posterior marginals to the exact values. We also present comparisons 
with the likelihood- weighted sampler of Shwe and Cooper (1991). 

In the second section we present results for the remaining, intractable CPC cases. We 
use lengthy runs of the Shwe and Cooper sampling algorithm to provide a surrogate for the 
gold standard in these cases. 

Finally, in the third section we consider the problem of obtaining interval bounds on 
the posterior marginals. 



5.1 Comparison to Exact Marginals 

Four of the CPC cases have 20 or fewer positive findings (see Table 1), and for these cases 
it is possible to calculate the exact values of the likelihood and the posterior marginals 
in a reasonable amount of time. We used Heckerman's "Quickscore" algorithm (Hecker- 
man 1989) — an algorithm tailored to the QMR-DT architecture — to perform these exact 
calculations. 

Figure 3 shows the log-likelihood for the four tractable CPC cases. The figure also shows 
the variational lower and upper bounds. We calculated the variational bounds twice, with 
differing numbers of positive findings treated exactly in the two cases ("treated exactly" 
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case 


^ of pos. findings 


^ of neg. findings 


1 


20 


14 


2 


10 


21 


3 


19 


19 


4 


19 


33 



Table 1: Description of the cases for which we evaluated the exact posterior marginals. 
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Figure 3: Exact values and variational upper and lower bounds on the log-likelihood 
logP(/"'"|^) for the four tractable CPC cases. In (a) 8 positive findings were 
treated exactly, and in (b) 12 positive findings were treated exactly. 



simply means that the finding is not transformed variationally) . In panel (a) there were 8 
positive findings treated exactly, and in (b) 12 positive findings were treated exactly. As 
expected, the bounds were tighter when more positive findings were treated exactly.'* 

The average running time across the four tractable CPC cases was 26.9 seconds for 
the exact method, 0.11 seconds for the variational method with 8 positive findings treated 
exactly, and 0.85 seconds for the variational method with 12 positive findings treated exactly. 
(These results were obtained on a 433 MHz DEC Alpha computer). 

Although the likelihood is an important quantity to approximate (particularly in appli- 
cations in which parameters need to be estimated), of more interest in the QMR-DT setting 
are the posterior marginal probabilities for the individual diseases. As we discussed in the 
previous section, the simplest approach to obtaining variational estimates of these quan- 
tities is to define an approximate variational distribution based either on the distribution 
P(/"'"|^), which upper-bounds the likelihood, or the distribution -?(/"*" |^), which lower- 
bounds the likelihood. For fixed values of the variational parameters (chosen to provide 
a tight bound to the likelihood), both distributions provide partially factorized approxi- 
mations to the joint probability distribution. These factorized forms can be exploited as 

4. Given that a significant fraction of tlie positive findings are being treated exactiy in tliese simuiations, one 
may wonder wliat if any additional accuracy is due to tlie variational transformations. We address this 
concern later in this section and demonstrate that the variational transformations are in fact responsible 
for a significant portion of the accuracy in these cases. 
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efficient approximate inference engines for general posterior probabilities, and in particular 
we can use them to provide approximations to the posterior marginals of individual diseases. 

In practice we found that the distribution -?(/"*" |0 yielded more accurate posterior 
marginals than the distribution P{f~^\q), and we restrict our presentation to P{f~^\^)- Fig- 
ure 4 displays a scatterplot of these approximate posterior marginals, with panel (a) corre- 



■B0.6 



0.4 0.6 
exact marginals 



(b) 




0.4 0.6 
exact marginals 



Figure 4: Scatterplot of the variational posterior estimates and the exact marginals. In 
(a) 8 positive findings were treated exactly and in (b) 12 positive findings were 
treated exactly. 



sponding to the case in which 8 positive findings were treated exactly and panel (b) the case 
in which 12 positive findings treated exactly. The plots were obtained by first extracting 
the 50 highest posterior marginals from each case using exact methods and then computing 
the approximate posterior marginals for the corresponding diseases. If the approximate 
marginals are in fact correct then the points in the figures should align along the diagonals 
as shown by the dotted lines. We see a reasonably good correspondence — the variational 
algorithm appears to provide a good approximation to the largest posterior marginals. (We 
quantify this correspondence with a ranking measure later in this section). 

A current state-of-the-art algorithm for the QMR-DT is the enhanced version of likelihood- 
weighted sampling proposed by Shwe and Cooper (1991). Likelihood- weighted sampling is 
a stochastic sampling method proposed by Fung and Chang (1990) and Shachter and Peot 
(1990). Likelihood-weighted sampling is basically a simple forward sampling method that 
weights samples by their likelihoods. It can be enhanced and improved by utilizing "self- 
importance sampling" (see Shachter & Peot, 1990), a version of importance sampling in 
which the importance sampling distribution is continually updated to refiect the current 
estimated posterior distribution. Middleton et al. (1990) utilized likelihood-weighted sam- 
pling with self-importance sampling (as well as a heuristic initialization scheme known as 
"iterative tabular Bayes") for the QMR-DT model and found that it did not work sat- 
isfactorily. Subsequent work by Shwe and Cooper (1991), however, used an additional 
enhancement to the algorithm known as 'Markov blanket scoring" (see Shachter & Peot, 
1990), which distributes fractions of samples to the positive and negative values of a node 
in proportion to the probability of these values conditioned on the Markov blanket of the 
node. The combination of Markov blanket scoring and self-importance sampling yielded 
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an effective algorithm.^ In particular, with these modifications in place, Shwe and Cooper 
reported reasonable accuracy for two of the difficult CPC cases. 

We re-implemented the likelihood-weighted sampling algorithm of Shwe and Cooper, 
incorporating the Markov blanket scoring heuristic and self-importance sampling. (We did 
not utilize "iterative tabular Bayes" but instead utilized a related initialization scheme- 
"heuristic tabular Bayes" -also discussed by Shwe and Cooper). In this section we discuss 
the results of running this algorithm on the four tractable CPC cases, comparing to the 
results of variational inference.^ In the following section we present a fuller comparative 
analysis of the two algorithms for all of the CPC cases. 

Likelihood-weighting sampling, and indeed any sampling algorithm, realizes a time- 
accuracy tradeoff — taking additional samples requires more time but improves accuracy. 
In comparing the sampling algorithm to the variational algorithm, we ran the sampling 
algorithm for several different total time periods, so that the accuracy achieved by the 
sampling algorithm roughly covered the range achieved by the variational algorithm. The 
results are shown in Figure 5, with the right-hand curve corresponding to the sampling runs. 
The figure displays the mean correlations between the approximate and exact posterior 
marginals across ten independent runs of the algorithm (for the four tractable CPC cases). 




execution time in seconds 



Figure 5: The mean correlation between the approximate and exact posterior marginals as 
a function of the execution time (in seconds). Solid line: variational estimates; 
dashed line: likelihood-weighting sampling. The lines above and below the sam- 
pling result represent standard errors of the mean based on the ten independent 
runs of the sampler. 

Variational algorithms are also characterized by a time-accuracy tradeoff. In particular, 
the accuracy of the method generally improves as more findings are treated exactly, at 
the cost of additional computation. Figure 5 also shows the results from the variational 
algorithm (the left-hand curve). The three points on the curve correspond to up to 8, 12 and 



5. The initialization method proved to have Uttle effect on the inference results. 

6. We also investigated Gibbs sampling (Pearl, 1988). The results from Gibbs sampling were not as good 
as the results from likelihood-weighted sampling, and we report only the latter results in the remainder 
of the paper. 
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16 positive findings treated exactly. Note that the variational estimates are deterministic 
and thus only a single run was made. 

The figure shows that to achieve roughly equivalent levels of accuracy, the sampling 
algorithm requires significantly more computation time than the variational method. 

Although scatterplots and correlation measures provide a rough indication of the accu- 
racy of an approximation algorithm, they are deficient in several respects. In particular, in 
diagnostic practice the interest is in the ability of an algorithm to rank diseases correctly, 
and to avoid both false positives (diseases that are not in fact significant but are included 
in the set of highly ranked diseases) and false negatives (significant diseases that are omit- 
ted from the set of highly ranked diseases). We defined a ranking measure as follows (see 
also Middleton et al., 1990). Consider a set of the N highest ranking disease hypotheses, 
where the ranking is based on the correct posterior marginals. Corresponding to this set 
of diseases we can find the smallest set of N' approximately ranked diseases that includes 
the N significant ones. In other words, for any N "true positives" an approximate method 
produces N' — N "false positives." Plotting false positives as a function of true positives 
provides a meaningful and useful measure of the accuracy of an approximation scheme. 
To the extent that a method provides a nearly correct ranking of true positives the plot 
increases slowly and the area under the curve is small. When a significant disease appears 
late in the approximate ordering the plot increases rapidly near the true rank of the missed 
disease and the area under the curve is large. 

We also plot the number of "false negatives" in a set of top N highly ranked diseases. 
False negatives refer to the number of diseases, out of the N highest ranking diseases, 
that do not appear in the set of N approximately ranked diseases. Note that unlike the 
previous measure, this measure does not reveal the severity of the misplacements, only their 
frequency. 

With this improved diagnostic measure in hand, let us return to the evaluation of the 
inference algorithms, beginning with the variational algorithm. Figure 6 provides plots of 




Figure 6: (a) Average number of false positives as a function of true positives for the varia- 
tional method (solid lines) and the partially-exact method (dashed line), (b) False 
negatives in the set of top N approximately ranked diseases. In both figures 8 
positive findings were treated exactly. 

the false positives (panel a) and false negatives (panel b) against the true positives for the 
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Figure 7: (a) Average number of false positives as a function of true positives for the varia- 
tional method (solid line) and the partially-exact method (dashed line), (b) False 
negatives in the set of top N approximately ranked diseases. In both figures 12 
positive findings were treated exactly. 



tractable CPC cases. Eight positive findings were treated exactly in the simulation shown 
in this figure. Figure 7 displays the results when 12 positive finding were treated exactly. 

As we noted earlier, 8 and 12 positive findings comprise a significant fraction of the 
total positive findings for the tractable CPC cases, and thus it is important to verify that 
the variational transformations are in fact contributing to the accuracy of the posterior 
approximations above and beyond the exact calculations. We did this by comparing the 
variational method to a method which we call the "partially-exact" method in which the 
posterior probabilities were obtained using only those findings that were treated exactly in 
the variational calculations (i.e., using only those findings that were not transformed). If 
the variational transformations did not contribute to the accuracy of the approximation, 
then the performance of the partially-exact method should be comparable to that of the 
variational method.'' Figure 6 and Figure 7 clearly indicate that this is not the case. The 
difference in accuracy between these methods is substantial while their computational load 
is comparable (about 0.1 seconds on a 433MHz Dec Alpha). 

We believe that the accuracy portrayed in the false positive plots provides a good in- 
dication of the potential of the variational algorithm for providing a practical solution to 
the approximate inference problem for the QMR-DT. As the figures show, the number of 
false positives grows slowly with the number of true positives. For example, as shown in 
Figure 6 where eight positive findings are treated exactly, to find the 20 most likely diseases 
we would only need to entertain the top 23 diseases in the list of approximately ranked 
diseases (compared to more than 50 for the partially-exact method). 

The ranking plot for the likelihood-weighted sampler is shown in Figure 8, with the 
curve for the variational method from Figure 7 included for comparison. To make these 
plots, we ran the likelihood-weighted sampler for an amount of time (6.15 seconds) that was 

7. It should be noted that this is a conservative comparison, because the partially-exact method in fact 
benefits from the variational transformation — the set of exactly treated positive findings is selected on 
the basis of the accuracy of the variational transformations, and these accuracies correlate with the 
diagnostic relevance of the findings. 
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Figure 8: Average number of false positives as a function of true positives for the likelihood- 
weighted sampler (dashed line) and the variational method (solid line) with 12 
positive findings treated exactly. 



comparable to the time allocated to our slowest variational method (3.17 seconds; this was 
the case in which 16 positive findings were treated exactly. Recall that the time required 
for the variational algorithm with 12 positive findings treated exactly was 0.85 seconds.) As 
the plots show, for these tractable CPC cases, the variational method is significantly more 
accurate than the sampling algorithm for comparable computational loads. 

5.2 The Full CPC Corpus 

We now consider the full CPC corpus. The majority of these cases (44 of 48 cases), have 
more than 20 positive findings and thus appear to be beyond the reach of exact methods. 

An important attraction of sampling methods is the mathematical guarantee of accurate 
estimates in the limit of a sufficiently large sample size (Gelfand & Smith, 1990). Thus 
sampling methods have the promise of providing a general methodology for approximate 
inference, with two caveats: (1) the number of samples that is needed can be difficult to 
diagnosis, and (2) very many samples may be required to obtain accurate estimates. For 
real-time applications, the latter issue can rule out sampling solutions. However, long-term 
runs of a sampler can still provide a useful baseline for the evaluation of the accuracy of faster 
approximation algorithms. We begin by considering this latter possibility in the context of 
likelihood- weighted sampling for the QMR-DT. We then turn to a comparative evaluation 
of likelihood-weighted sampling and variational methods in the time-limited setting. 

To explore the viability of the likelihood-weighted sampler for providing a surrogate for 
the gold standard, we carried out two independent runs each consisting of 400,000 samples. 
Figure 9(a) shows the estimates of the log-likelihood from the first sampling run for all 
of the CPC cases. We also show the variational upper and lower bounds for these cases 
(the cases have been sorted according to the lower bound). Note that these bounds are 
rigorous bounds on the true log-likelihood, and thus they provide a direct indication of the 
accuracy of the sampling estimates. Although we see that many of the estimates lie between 
the bounds, we also see in many cases that the sampling estimates deviate substantially 
from the bounds. This suggests that the posterior marginal estimates obtained from these 
samples are likely to be unreliable as well. Indeed, Figure 9(b) presents a scatterplot of 
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Figure 9: (a) Upper and lower bounds (solid lines) and the corresponding sampling esti- 
mates (dashed line) of the log-likelihood of observed findings for the CPC cases, 
(b) A correlation plot between the posterior marginal estimates from two inde- 
pendent sampling runs. 



estimated posterior marginals for the two independent runs of the sampler. Although we 
see many cases in which the results lie on the diagonal, indicating agreement between the 
two runs, we also see many pairs of posterior estimates that are far from the diagonal. 

These results cast some doubt on the viability of the likelihood-weighted sampler as a 
general approximator for the full set of CPC cases. Even more problematically we appear 
to be without a reliable surrogate for the gold standard for these cases, making it difficult 
to evaluate the accuracy of real-time approximations such as the variational method. Note, 
however, that the estimates in Figure 9(a) seem to fall into two classes — estimates that 
lie within the variational bounds and estimates that are rather far from the bounds. This 
suggests the possibility that the distribution being sampled from is multi-modal, with some 
estimates falling within the correct mode and providing good approximations and with 
others falling in spurious modes and providing seriously inaccurate approximations. If the 
situation holds, then an accurate surrogate for the gold standard might be obtained by using 
the variational bounds to filter the sampling results and retaining only those estimates that 
lie between the bounds given by the variational approach. 

Figure 10 provides some evidence of the viability of this approach. In 24 out of the 48 
CPC cases both of the independent runs of the sampler resulted in estimates of the log- 
likelihood lying approximately within the variational bounds. We recomputed the posterior 
marginal estimates for these selected cases and plotted them against each other in the figure. 
The scatterplot shows a high degree of correspondence of the posterior estimates in these 
cases. We thus tentatively assume that these estimates are accurate enough to serve as a 
surrogate gold standard and proceed to evaluate the real-time approximations. 

Figure 11 plots the false positives against the true positives on the 24 selected CPC 
cases for the variational method. Twelve positive findings were treated exactly in this 
simulation. Obtaining the variational estimates took 0.29 seconds of computer time per 
case. Although the curve increases more rapidly than with the tractable CPC cases, the 
variational algorithm still appears to provide a reasonably accurate ranking of the posterior 
marginals, within a reasonable time frame. 



310 



Variational Probabilistic Inference and QMR-DT 




sampling estimates 1 



Figure 10: A correlation plot between the selected posterior marginal estimates from two 
independent sampling runs, where the selection was based on the variational 
upper and lower bounds. 




Figure 11: Average number of false positives as a function of true positives for the vari- 
ational method (solid line) and the likelihood- weighted sampler (dashed line). 
For the variational method 12 positive findings were treated exactly, and for the 
sampler the results are averages across ten runs. 



To compare the variational algorithm to a time-limited version of the likelihood- weighted 
sampler we ran the latter algorithm for a period of time (8.83 seconds per case) roughly com- 
parable to the running time of the variational algorithm (0.29 seconds per case). Figure 11 
shows the corresponding plot of false positives against true positives, where we have aver- 
aged over ten independent runs. We see that the curve increases significantly more steeply 
than the variational curve. To find the 20 most likely diseases with the variational method 
we would only need to entertain the top 30 diseases in the list of approximately ranked 
diseases. For the sampling method we would need to entertain the top 70 approximately 
ranked diseases. 

5.3 Interval Bounds on the Marginal Probabilities 

Thus far we have utilized the variational approach to produce approximations to the poste- 
rior marginals. The approximations that we have discussed originate from upper and lower 
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bounds on the likelihood, but they are not themselves bounds. That is, they are not guar- 
anteed to lie above or below the true posteriors, as we see in Figure 4. As we discussed in 
Section 4.1, however, it is also possible to induce upper and lower bounds on the posterior 
marginals from upper and lower bounds on the likelihood (cf. Eq. 33). In this section we 
evaluate these interval bounds for the QMR-DT posterior marginals. 

Figure 12 displays histogram of the interval bounds for the four tractable CPC cases, the 
24 selected CPC cases from the previous section, and all of the CPC cases. These histograms 
include all of the diseases in the QMR-DT network. In the case of the tractable cases the 




Figure 12: Histograms of the size of the interval bounds on all of the diseases in the QMR- 
DT network for (a) the four tractable CPC cases, (b) the 24 selected CPC cases 
from the previous section, and (c) all of the CPC cases. 



variational method was run with 12 positive findings treated exactly. For the remaining 
CPC cases the variational method was run with 16 positive findings treated exactly. The 
running time of the algorithm was less than 10 seconds of computer time per CPC case. 

For the tractable CPC cases, the interval bounds are tight for nearly all of the diseases 
in the network. However, (1) few of the positive findings are treated variationally in these 
cases, and (2) there is no need in practice to compute variational bounds for these cases. 
We get a somewhat better picture of the viability of the variational interval bounds in 
Figure 12(b) and Figure 12(c), and the picture is decidedly mixed. For the 24 selected 
cases, tight bounds are provided for approximately half of the diseases. The bounds are 
vacuous for approximately a quarter of the diseases, and there are a range of diseases in 
between. When we consider all of the CPC cases, approximately a third of the bounds are 
tight and nearly half are vacuous. 

Although these results may indicate limitations in our variational approximation, there 
is another more immediate problem that appears to be responsible for the looseness of 
the bounds in many cases. In particular, recall that we use the Quickscore algorithm 
(Heckerman, 1989) to handle the exact calculations within the framework of our variational 
algorithm. Unfortunately Quickscore suffers from vanishing numerical precision for large 
numbers of positive findings, and in general we begin to run into numerical problems, 
resulting in vacuous bounds, when 16 positive findings are incorporated exactly into the 
variational approximation. Thus, although it is clearly of interest to run the variational 
algorithm for longer durations, and thereby improve the bounds, we are unable to do so 
within our current implementation of the exact subroutine. 
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While it is clearly worth studying methods other than Quickscore for treating the ex- 
act findings within the variational algorithm, it is also of interest to consider combining 
variational methods with other methods, such as search-based or other partial evaluation 
methods, that are based on intervals. These methods may help in simplifying the posterior 
and obviating the need for improving the exact calculations. 

It is also worth emphasizing the positive aspect of these results and their potential 
practical utility. The previous section showed that the variational method can provide ac- 
curate approximations to the posterior marginals. Combined with the interval bounds in 
this section — which are calculated efficiently — the user can obtain guarantees on approxi- 
mately a third of these approximations. Given the relatively benign rate of increase in false 
positives as a function of true positives (Figure 11), such guarantees may suffice. Finally, 
for diseases in which the bounds are loose there are also perturbation methods available 
(Jaakkola, 1997) that can help to validate the approximations for these diseases. 

6. Discussion 

Let us summarize the variational inference method and evaluate the results that we have 
obtained. 

The variational method begins with parameterized upper and lower bounds on the indi- 
vidual conditional probabilities at the nodes of the model. For the QMR-DT, these bounds 
are exponentials of linear functions, and introducing them into the model corresponds to 
delinking nodes from the graph. Sums of products of these bounds yield bounds, and thus 
we readily obtain parameterized bounds on marginal probabilities, in particular upper and 
lower bounds on the likelihood. 

We exploited the likelihood bounds in evaluating the output of the likelihood-weighted 
sampling algorithm. Although the sampling algorithm did not yield reliable results across 
the corpus of CPC cases, when we utilized the variational upper and lower bounds to select 
among the samples we were able to obtain sampling results that were consistent between 
runs. This suggests a general procedure in which variational bounds are used to assess the 
convergence of a sampling algorithm. (One can also imagine a more intimate relationship 
between these algorithms in which the variational bounds are used to adjust the on-line 
course of the sampler). 

The fact that we have bounds on the likelihood (or other marginal probabilities) is 
critical — the bounding property allows us to find optimizing values of the variational pa- 
rameters by minimizing the upper-bounding variational distribution and maximizing the 
lower-bounding variational distribution. In the case of the QMR-DT network (a bipar- 
tite noisy-OR graph), the minimization problem is a convex optimization problem and the 
maximization problem is solved via the EM algorithm. 

Once the variational parameters are optimized, the resulting variational distribution can 
be exploited as an inference engine for calculating approximations to posterior probabilities. 
This technique has been our focus in the paper. Graphically, the variationally transformed 
model can be viewed as a sub-graph of the original model in which some of the finding 
nodes have been delinked. If a sufficient number of findings are delinked variationally 
then it is possible to run an exact algorithm on the resulting graph. This approach yields 
approximations to the posterior marginals of the disease nodes. 
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We found empirically that these approximations appeared to provide good approxima- 
tions to the true posterior marginals. This was the case for the tractable set of CPC cases 
(cf. Figure 7) and — subject to our assumption that we have obtained a good surrogate for 
the gold standard via the selected output of the sampler — also the case for the full CPC 
corpus (cf. Figure 11). 

We also compared the variational algorithm to a state-of-the-art algorithm for the QMR- 
DT, the likelihood- weighted sampler of Shwe and Cooper (1991). We found that the varia- 
tional algorithm outperformed the likelihood-weighted sampler both for the tractable cases 
and for the full corpus. In particular, for a fixed accuracy requirement the variational algo- 
rithm was significantly faster (cf. Figure 5), and for a fixed time allotment the variational 
algorithm was significantly more accurate (cf. Figure 8 and Figure 11). 

Our results were less satisfactory for the interval bounds on the posterior marginals. 
Across the full CPC corpus we found that for approximately one third of the disease the 
bounds were tight but for half of the diseases the bounds were vacuous. A major impediment 
to obtaining tighter bounds appears to lie not in the variational approximation per se but 
rather in the exact subroutine, and we are investigating exact methods with improved 
numerical properties. 

Although we have focused in detail on the QMR-DT model in this paper, it is worth 
noting that the variational probabilistic inference methodology is considerably more general. 
Specifically, the methods that we have described here are not limited to the bi-partite 
graphical structure of the QMR-DT model, nor is it necessary to employ noisy-OR nodes 
(Jaakkola & Jordan, 1996). It is also the case that the type of transformations that we 
have exploited in the QMR-DT setting extend to a larger class of dependence relations 
based on generalized linear models (Jaakkola, 1997). Finally, for a review of applications of 
variational methods to a variety of other graphical model architectures, see Jordan, et al. 
(1998). 

A promising direction for future research appears to be in the integration of various 
kinds of approximate and exact methods (see, e.g., Dagum & Horvitz, 1992; Jensen, Kong, 
& Kjaerulff, 1995). In particular, search-based methods (Cooper, 1985; Peng & Reggia, 
1987, Henrion, 1991) and variational methods both yield bounds on probabilities, and, as 
we have indicated in the introduction, they seem to exploit different aspects of the struc- 
ture of complex probability distributions. It may be possible to combine the bounds from 
these algorithm — the variational bounds might be used to guide the search, or the search- 
based bounds might be used to aid the variational approximation. Similar comments can 
be made with respect to localized partial evaluation methods and bounded conditioning 
methods (Draper & Hanks, 1994; Horvitz, et al., 1989). Also, we have seen that variational 
bounds can be used for assessing whether estimates from Monte Carlo sampling algorithms 
have converged. A further interesting hybrid would be a scheme in which variational ap- 
proximations are refined by treating them as initial conditions for a sampler. 

Even without extensions our results in this paper appear quite promising. We have 
presented an algorithm which runs in real time on a large-scale graphical model for which 
exact algorithms are in general infeasible. The results that we have obtained appear to 
be reasonably accurate across a corpus of difficult diagnostic cases. While further work 
is needed, we believe that our results indicate a promising role for variational inference in 
developing, critiquing and exploiting large-scale probabilistic models such as the QMR-DT. 
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Appendix A. Duality 

The upper and lower bounds for individual conditional probability distributions that form 
the basis of our variational method are based on the "dual" or "conjugate" representations 
of convex functions. We present a brief description of convex duality in this appendix, and 
refer the reader to Rockafellar (1970) for a more extensive treatment. 

Let f{x) be a real-valued, convex function defined on a convex set X (for example, 
X = i?"). For simplicity of exposition, we assume that / is a well-behaved (differentiable) 
function. Consider the graph of /, i.e., the points {x,f{x)) in an ra -|- 1 dimensional space. 
The fact that the function / is convex translates into convexity of the set {{x,y) : y > f{x)} 
called the epigraph of / and denoted by epi{f) (Figure 13). It is an elementary property 




X ^' - y - f*(^') < 0,,"" X ^ - y - f*® < 

x^-y-|.i<0 



Figure 13: Half-spaces containing the convex set epi{f). The conjugate function /*(^) 
defines the critical half-spaces whose intersection is epi{f), or, equivalently, it 
defines the tangent planes of f{x). 



of convex sets that they can be represented as the intersection of all the half-spaces that 
contain them (see Figure 13). Through parameterizing these half-spaces we obtain the dual 
representations of convex functions. To this end, we define a half-space by the condition: 

all {x, y) such that x'^^ — y — 1^ 1^ Q (34) 

where ^ and parameterize all (non-vertical) half-spaces. We are interested in character- 
izing the half-spaces that contain the epigraph of /. We require therefore that the points 
in the epigraph must satisfy the half-space condition: for {x,y) G epi{f), we must have 
x'^^ — y — fj, < 0. This holds whenever x'^^ — f[x) — fi < as the points in the epigraph have 
the property that y > f{x). Since the condition must be satisfied by all x £ X , it follows 
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that 



max{ x'^^ — f{x) — /U } < 0, 



(35) 



as well. Equivalently, 



1^ > max{ x'^^ — f{x) } 



(36) 



where the right-hand side of this equation defines a function of ^, which is known as the 
"dual" or "conjugate" function /*(^). This function, which is also a convex function, defines 
the critical half-spaces which are needed for the representation of epi{f) as an intersection 
of half-spaces (Figure 13). 

To clarify the duality between f{x) and f*{x), let us drop the maximum and rewrite 
the inequality as: 



In this equation, the roles of the two functions are interchangeable and we may suspect that 
also f{x) can obtained from the dual function f*{x) by an optimization procedure. This is 
in fact the case and we have: 



This equality states that the dual of the dual gives back the original function. It provides 
the computational tool for calculating dual functions. 

For concave (convex down) functions the results are analogous; we replace max with 
min, and lower bounds with upper bounds. 

Appendix B. Optimization of the Variational Parameters 

The variational method that we have described involves replacing selected local conditional 
probabilities with either upper-bounding or lower-bounding variational transformations. 
Because the product of bounds is a bound, the variationally transformed joint probability 
distribution is a bound (upper or lower) on the true joint probability distribution. More- 
over, because sums of bounds is a bound on the sum, we can obtain bounds on marginal 
probabilities by marginalizing the variationally transformed joint probability distribution. 
In particular, this provides a method for obtaining bounds on the likelihood (the marginal 
probability of the evidence). 

Note that the variationally transformed distributions are bounds for arbitrary values of 
the variational parameters (because each individually transformed node conditional prob- 
ability is a bound for arbitrary values of its variational parameter). To obtain optimizing 
values of the variational parameters, we take advantage of the fact that our transformed 
distribution is a bound, and either minimize (in the case of upper bounds) or maximize 
(in the case of lower bounds) the transformed distribution with respect to the variational 
parameters. It is this optimization process which provides a tight bound on the marginal 
probability of interest (e.g., the likelihood) and thereby picks out a particular variational 
distribution that can subsequently be used for approximate inference. 




(37) 
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In this appendix we discuss the optimization problems that we must solve in the case 
of noisy-OR networks. We consider the upper and lower bounds separately, beginning with 
the upper bound. 

Upper Bound Transformations 

Our goal is to compute a tight upper bound on the likelihood of the observed findings: 
P{f'^) = J2d P'if^\d)P{d) ■ As discussed in Section 4.2, we obtain an upper bound on 
P{f'^\d) by introducing upper bounds for individual node conditional probabilities. We 
represent this upper bound as P{f~^\d,^), which is a product across the individual varia- 
tional transformations and may contain contributions due to findings that are being treated 
exactly (i.e., are not transformed). Marginalizing across d we obtain a bound: 



P{f+)<J2Pif^\d,0P{d)^P{f+\0 

d 



(39) 



It is this latter quantity that we wish to minimize with respect to the variational parameters 

To simplify the notation we assume that the first m positive findings have been trans- 
formed (and therefore need to be optimized) while the remaining conditional probabilities 
will be treated exactly. In this notation P{f~^\^) is given by 



p{f^\o = E 



liPiftld,^. 



liPiftld) 



.«>m 



n pid, 



cx 



P{ I[Pift\d,^.)\, 



(40) 
(41) 



where the expectation is taken with respect to the posterior distribution for the diseases 
given those positive findings that we plan to treat exactly. Note that the proportionality 
constant does not depend on the variational parameters (it is the likelihood of the exactly 
treated positive findings). We now insert the explicit forms of the transformed conditional 
probabilities (see Eq. (17)) into Eq. (41) and find: 



i<im 



(42) 
(43) 



where we have simply converted the products over i into sums in the exponent and pulled 
out the terms that are constants with respect to the expectation. On a log-scale, the 
proportionality becomes an equivalence up to a constant: 



log P{f+\0 =C+J2 fe^^o - r(6)) +logi?(e^. 



(44) 
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Several observations are in order. Recall that f*{^i) is the conjugate of the concave function 
/ (the exponent), and is therefore also concave; for this reason —f*{^i) is convex. In 
Appendix C we prove that the remaining term: 

logE \e^j.'<"-^'^'''^'\ (45) 



is also a convex function of the variational parameters. Now, since any sum of convex 
functions is convex, we conclude that logP(/"'"|^) is a convex function of the variational 
parameters. This means that there are no local minima in our optimization problem. We 
may safely employ the standard Newton-Raphson procedure to solve V log P(/"'" |^) = 0. 
Alternatively we can utilize fixed-point iterations. In particular, we calculate the derivatives 
of the variational form and iteratively solve for the individual variational parameters such 
that the derivatives are zero. The derivatives are given as follows: 

AiogP(/+|e) = ^,o + log^ + i?|^^,,.rfj (46) 



d 



2 



(47) 



where the expectation and the variance are with respect to the posterior approximation 
P{d\f'^ ,^), and both derivatives can be computed in time linear in the number of associ- 
ated diseases for the finding. The benign scaling of the variance calculations comes from 
exploiting the special properties of the noisy-OR dependence and the marginal independence 
of the diseases. 

Calculating the expectations in Eq. (7) is exponentially costly in the number of exactly 
treated positive findings. When there are a large number of positive findings, we can have 
recourse to a simplified procedure in which we optimize variational parameters after having 
transformed all or most of the positive findings. While the resulting variational parameters 
are suboptimal, we have found in practice that the incurred loss in accuracy is typically quite 
small. In the simulations reported in the paper, we optimized the variational parameters 
after approximately half of the exactly treated findings had been introduced. (To be precise, 
in the case of 8, 12 and 16 total findings treated exactly, we optimized the parameters after 
4, 8, and 8 findings, respectively, were introduced). 



Lower Bound Transformations 

Mimicking the case of upper bounds, we replace individual conditional probabilities of 
the findings with lower-bounding transformations, resulting in a lower-bounding expression 
P{f~^\d, q) . Taking the product with P{d) and marginalizing over d yields a lower bound 
on the likelihood: 

P{f+) >J2P{f+\d,q)P{d) = P{f+\q). (48) 

d 

We wish to maximize P{f'^\q) with respect to the variational parameters q to obtain the 
tightest possible bound. 
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Our problem can be mapped onto a standard optimization problem in statistics. In 
particular, treating c? as a latent variable, / as an observed variable, and g as a parameter 
vector, the optimization of -?(/"*" |g) (or its logarithm) can be viewed as a standard maximum 
likelihood estimation problem for a latent variable model. It can be solved using the EM 
algorithm (Dempster, Laird, & Rubin, 1977). The algorithm yields a sequence of variational 
parameters that monotonically increase the objective function logP(/"'"|g). Within the EM 
framework, we obtain an update of the variational parameters by maximizing the expected 
complete log-likelihood: 

E { log P{f+\d, q)P{d) } = Y,E [\og P{ft\d, q.\i) ] + constant, (49) 

i 

where cf^"^ denotes the vector of variational parameters before the update, where the con- 
stant term is independent of the variational parameters q and where the expectation is with 
respect to the posterior distribution P[d\f~^ , q°''^) cx P(f~^\d, q°^'^)P[d). Since the variational 
parameters associated with the conditional probabilities P[f^'\d, g.|j) are independent of one 
another, we can maximize each term in the above sum separately. Recalling the form of the 
variational transformation (see Eq. (24)), we have: 



i?{logP(/+|rf,g.|,)} = Y.'mE{d,] 

3 

+no^o) (50) 



which we are to maximize with respect to while keeping the expectations E{dj} fixed. 
This optimization problem can be solved iteratively and monotonically by performing the 
following synchronous updates with normalization: 



g^l, ^ E{dj} 



(51) 



where /' denotes the derivative of /. (The update is guaranteed to be non-negative). 

This algorithm can be easily extended to handle the case where not all the positive 
findings have been transformed. The only new feature is that some of the conditional 
probabilities in the products P[f~^\d,q°^'^) and P[f~^\d,q) have been left intact, i.e., not 
transformed; the optimization with respect to the variational parameters corresponding to 
the transformed conditionals proceeds as before. 

Appendix C. Convexity 

The purpose of this appendix is to demonstrate that the function: 

\ogE{e^i.^<--^'^''^'\ (52) 



is a convex function of the variational parameters ^i. We note first that affine transforma- 
tions do not change convexity properties. Thus convexity in X = J2j i<m^i^ijdj implies 
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convexity in the variational parameters ^. It remains to show that 

logi?{e^} = log^p,e^' =/(X) (53) 

i 

is a convex function of the vector X = {Xi . . .Xn}^; here we have indicated the discrete 
values in the range of the random variable X by Xi and denoted the probability measure 
on such values by pi. Taking the gradient of / with respect to X^ gives: 

^ -/(^) = - (54) 



dXk Pi e ' 

where defines a probability distribution. The convexity is revealed by a positive semi- 
definite Hessian 7i, whose components in this case are 

^fc/ = QXkdx/ ^^^ ^ ^'"'^'^ ~ ^''^^ ^^^^ 
To see that 7i is positive semi-definite, consider 

Z^nZ = J2 QkZl - {J2 IkZk) (E 11^1) = Var{Z} > (56) 

k k I 

where Yar{Z} is the variance of a discrete random variable Z which takes the values Zi 
with probability g^. 
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